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HMAX is a well-known computational model of visual 
recognition in cortex consisting of just two computa- 
tional operations - a "template match" and non-linear 
pooling - alternating in a feedforward hierarchy in which 
receptive fields exhibit increasing specificity and invar- 
iance [1]. Interestingly, auditory recognition problems 
(such as speech recognition) share similar computational 
requirements, and recent work in auditory neuroscience 
suggests that auditory and visual cortex share similar 
anatomical and functional organization. Based on these 
similarities, we tested whether HMAX could support an 
auditory recognition task (specifically, word spotting). 

To test HMAX on word spotting, recorded speech 
samples from the TIMIT corpus [2] were first converted 
into time-frequency spectrograms using a computational 
model of the auditory periphery [3]. These spectrograms 
were then split into 750 ms frames and input to a stan- 
dard HMAX model [4]. Based on observed similarities 
between the receptive fields in primary auditory cortex 
(spectro-temporal receptive fields, or STRFs) and primary 
visual cortex (typically modeled as oriented Gabor filters), 
we used SI filters identical to those used in vision [4]. 
Similarly, S2 "patches" were randomly selected from CI 
representations of speech sounds drawn from an inde- 
pendent speech corpus. One vs. all linear support vector 
machines (SVMs) were then trained to discriminate 
frames that contain a target word from those that did 
not. These SVMs were then tested on a novel set of test 
sentences using a sliding frame approach (750 ms frame 
size, 20 ms step size). For each frame in a sentence, the 
SVM produced a distance from the hyperplane, and a 
threshold value was applied to produce a binary classifi- 
cation whether or not the target word was present in 
the sentence. When tested on target words that appeared 
in a fixed context (i.e. SA sentences in TIMIT), 
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performance was highly robust, with ROC areas consis- 
tently above 0.9. When tested on target words that 
appeared in variable contexts (i.e., SI sentences in 
TIMIT), performance was somewhat decreased with 
ROC areas around 0.8. This decrease in performance 
is likely due to the inclusion of "clutter" (i.e., target irrele- 
vant features) within the frame, also commonly observed 
when HMAX is applied to visual object recognition 
tasks [1]. 

These results are novel in that they provide support for 
the hypothesis that the simple computational framework 
implemented in HMAX - consisting of a feedforward 
hierarchy of only two alternating computational opera- 
tions - may generalize beyond vision to support auditory 
recognition as well. It is possible that such a representa- 
tion could give rise to stable neural encodings that are 
invariant to behaviorally irrelevant characteristics as seen 
in higher order visual and auditory cortices [5,6]. While 
it is likely that this auditory version of the HMAX model 
would benefit from the use of more auditory-specific fil- 
ters based on STRF models [7], the Gabor features used 
here are largely compatible with previous computational 
models based on STRFs up to the level of primary audi- 
tory cortex [8]. Additional benefit may also be gained by 
learning sparse representations from natural sounds, at 
both the SI and S2 levels [9]. 
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